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ABSTRACT 

We present our solution to the Yandex Personalized Web 
Search Challenge. The aim of this challenge was to use the 
historical search logs to personalize top-N document rank¬ 
ings for a set of test users. We used over 100 features ex¬ 
tracted from user- and query-depended contexts to train 
neural net and tree-based learning-to-rank and regression 
models. Our final submission, which was a blend of sev¬ 
eral different models, achieved an NDCG@10 of 0.80476 and 
placed 4’th amongst the 194 teams winning 3’rd priz^ 

1. INTRODUCTION 

Personalized web search has recently been receiving a lot 
of attention from the IR community. The traditional one- 
ranking-for-all approach to search often fails for ambiguous 
queries (e.g. “jaguar”) that can refer to multiple entities. For 
such queries, non-personalized search engines typically try 
to retrieve a diverse set of results covering as many possible 
query interpretations as possible. This can result in highly 
suboptimal search sessions, where web pages that the user 
is looking for are very low in the returned ranking. 

In many such cases previous user search history can help 
resolve the ambiguity and personalize (re-rank) returned re¬ 
sults to user-specific information needs. Recently, a number 
of approaches have shown that search logs can be effectively 
mined to learn accurate personalization models 

, which can then be deployed to personalize retrieved re¬ 
sults in real time. Many of these models do not require any 
external information, and obtain all learning signals directly 
from the search logs. Such models are particularly effective 
since search logs can be collected at virtually no cost to the 
search engine, and most search engines already collect them 
by default. 

To encourage further research in this area Yandex recently 
partnered with Kaggle and organized the Personalized Web 
Search Challeng^ At the core of this challenge was a large 
scale search log dataset released by Yandex containing over 
160M search records. The goal of the challenge was to use 
these logs to personalize search results for a selected subset 
of test users. In this report we describe our approach to 
this problem. The rest of the paper is organized as follows. 
Section describes the challenge data and task in detail. 
Section introduces our approach in three stages: (1) data 

^ Top team “pampampampam” was from Yandex and did not 
officially participate in the competition. 

^WWW.haggle.com/c/yandex-personalized-web-search- 
challenge 
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Figure 1: Final leaderboard standings, our team 
“learner” placed 4’th amongst the 194 teams (261 
users) that participated in this challenge. 

partitioning, (2) feature extraction and (3) model training. 
Section [4] concludes with results. 

2. CHALLENGE DESCRIPTION 

In this challenge Yandex provided a month’s (30 days) 
worth of search engine logs for a set of users U = 
{tzi, ..., i/iv}. Each user u engaged with the search engine by 
issuing queries Qn = {^ni, •••, Queries that were is¬ 

sued “close” to each other in time were grouped into sessions. 
For each query qu the search engine retrieved a ranked list of 
web pages (documents) = {dqui^ returning 

it to the user. User then scanned this list (possibly) click¬ 
ing on some documents. Every such click is recorded in the 
logs together with time stamp and id of the document that 
was clicked. Only the top ten documents and their clicks 
(if any) were released for each query so Kq^ = 10 \/qu- For 
privacy reasons, very little information about queries and 
documents was provided. For queries, only numeric query 
id and numeric query-term ids were released. Similarly, for 
documents, only numeric document id and corresponding 
domain id (i.e. facebook.com for facebook pages) were re¬ 
leased. 

Clicks combined with dwell time (time spent on a page) 
can provide a good indication of document relevance to the 
user. In particular, it has been consistently found that 
longer dwell times strongly correlate with high relevance 



Table 1: Dataset statistics 


Unique queries 

21,073,569 

Unique documents 

70,348,426 

Unique users 

5,736,333 

Training sessions 

34,573,630 

Test sessions 

797,867 

Glicks in the training data 

64,693,054 

Total records in the log 

167,413,039 


Table 2: Document relevance distribution for train¬ 
ing and validation sets. 



Training 

Validation 

no click 

5,673,937 

1,993,602 

relevance 0 

115,713 

54,572 

relevance 1 

206,658 

196,290 

relevance 2 

728,662 

149,536 


leading to the concept of satisfied (SAT) clicks - clicks with 
dwell time longer than a predefined threshold (for example 
30 seconds) . Most existing personalization frame¬ 

works assume that documents with SAT clicks are relevant 
and use them to train/evaluate models. 

This competition adopted a similar evaluation framework 
where each document was assigned one of three relevance 
labels depending on whether it was clicked and click dwell 
time length. For privacy reasons dwell time was converted 
into anonymous “time units” and relevance labels were as¬ 
signed according to the following criteria: 

• relevance 0: documents with no clicks or dwell time 
strictly less than 50 time units 


Here tt : Dg^ ^ M^^} is a ranking produced by the 

model mapping each document dq^j to its rank 7r{j) = i, 
and j = L(7r“^(z)) is the relevance label of the 

document in position i in tt, and Gt(L) is a normalizing 
constant. Finally, T is a truncation constant which was set 
to 10 in this challenge. 

As commonly done in data mining challenges, test rele¬ 
vance labels were not released to the participants and all 
submission were internally evaluated by Haggle. Average 
NDCG@10 accuracies for approximately 50% of test queries 
were made visible throughout the challenge on the “public” 
leaderboard while the other 50% were used to calculate the 
final standings (“private” leaderboard). 


• relevance 1: documents with clicks and dwell time 
between 50 and 399 time units 

• relevance 2: documents with clicks and dwell time of 
at least 400 time units as well as documents with last 
click in session 


Using above criteria, a set of relevance labels Lg^ = 
{/g^i,...,/g^ic^} (one per document) can be generated for 
every issued query. Note that these relevance labels are 
personalized to the user who issued the query and express 
his/her preference over the returned documents. Given the 
relevance labels, the aim of the challenge was to develop a 
personalization model which would accurately re-rank the 
documents in the order of relevance to the user who issued 
the query. 

To ensure fair evaluation the data was partitioned into 
training and test sets. Training data consisted of all queries 
issued in the first 27 days of search activity. Test data con¬ 
sisted of queries sampled from the next 3 days of search 
activity. To generate the test data one query with at least 
one relevant (relevance > 0) document was sampled from 
797,867 users resulting in a fairly large test set with almost 
800K queries and 8M documents. In order to simulate real¬ 
time search personalization scenario, all search activity after 
each test query was removed from the data. Furthermore, to 
encourage discovery of medium and long term search pref¬ 
erence correlations all sessions except those that contained 
test queries were removed from the 3 day test periods. A 
diagram illustrating data partitioning is shown in Figure 
and full dataset statistics are shown in Table [T] 

All submissions were required to provide full document 
rankings for each of the 797,867 test queries and were eval¬ 
uated using the Normalized Discounter Gumulative Gain 
(NDGG) objective. Given a test query qu with doc¬ 
uments Dg^ and relevance labels Lg^, NDGG is defined by: 


NDGG(7r,Lg^)@T 


Gt(L) 


E 


2L(7r Ub) _ I 

log2(i + 1) 


( 1 ) 


3. OUR APPROACH 

In this section we describe our approach to this challenge. 
Before developing our models we surveyed existing work in 
this area and found that most personalization methods can 
be divided into three categories: heuristic, feature-based and 
user-based. Heuristic methods use search logs to com¬ 
pute user-specific document statistic, such as the number of 
historical clicks, and then use this statistic to re-rank the 
documents. Since it is often difficult to know which statis¬ 
tic will work best, feature-based models extract 

a diverse set of features used as input for machine learning 
methods that automatically learn personalization models. 
Note that while features are extracted separately for every 
user-query-document triplet, the same model is used to re¬ 
rank documents for all users. 

Finally, user-based methods 15 20 as the name sug¬ 
gests, learn separate models for each user. Some of these 
models use collaborative filtering techniques to infer la¬ 
tent factors for users and documents while others 

adapt learning-to-rank models by incorporating user-specific 
weights and biases [^ . 

User-based models allow the highest level of personaliza¬ 
tion but require extensive user search history and/or side 
information about queries and documents (such as top¬ 
ics, document features etc.). Given the sparsity of our 
data (70M unique documents in 160M records) and lack 
of user/query/document information we opted to use the 
feature-based approach. In the following sections we de¬ 
scribe in detail all the components that were necessary to 
create a feature-based model, namely data partitioning, fea¬ 
ture extraction and learning/inference algorithms. 


3.1 Dataset Partitioning 

We begin by describing our data partitioning strategy. 
Properly selected training/validation datasets are crucial to 
the success of any data mining model. Ideally we want these 
datasets to have very similar properties to the test data. To 
achieve this we carefully followed the query sampling proce- 
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Figure 2: Diagram showing data partition and training/validation/test query selection (in red) for a single 
user. Training query was always selected to be the last query in training period with at least one relevant 
(relevance > 0) document. Similarly, validation query was always selected to be the last query in test session 
with at least one relevant document. Test query was given a priori and was the last query in test session. 


dure (described in Section]^ used by competition adminis¬ 
trators to select test queries. 

For each user we first sorted all sessions by day (lowest 
to highest) randomly resolving ties since exact timestamps 
were not available. We then selected the last query in the 27 
day training period with at least one relevant (relevance > 
0) document as training query. Similarly, last query in test 
session with at least one relevant document was selected for 
validation. This selection process is shown in Figure 

The motivation behind choosing these specific queries was 
3-fold. First, since features can only be extracted from 
queries issued before the given query, we need to choose 
queries with as much historical data as possible. Select¬ 
ing queries at the end of training period and test session 
ensures maximum historical data. Second, there could be a 
large time gap between the end of training period and test 
session, and during that time the user’s search needs and 
preferences could change significantly. To capture this we 
need both training and validation queries to be as close as 
possible to test ones. However, since many test session did 
not have enough data to select two queries, only validation 
query was sampled from this session. Finally, only select¬ 
ing queries with at least one relevant document ensures that 
their is sufficient training signal for learning-to-rank models. 
Training objectives in these models are often order-based 
and thus require at least one relevant document. 

Applying this procedure to each of the 797,867 test users 
and removing users that did not have enough data, resulted 
in 672,497 training and 239,400 validation queries. Once 
the data was partitioned relevance labels were computed for 
all documents in both training and validation queries using 
the criteria outlined in Section |2] Table [2] shows relevance 
label distribution across documents in both sets. From this 
table we see that the majority of documents with clicks have 
relevance label 1 or 2. This suggests that once the user clicks 
on a document (s)he tends to spend “significant” amount of 
time going through the content of that document. It can 
also be seen that validation relevance distribution is similar 
to training one with the exception that training data has 
considerably more highly relevant (relevance 2) documents. 


3.2 Feature Extraction 

After partitioning the data and computing relevance la¬ 
bels we proceeded to feature extraction. Our aim was to 
extract features for every training, validation and test user- 
query-document triplet {u,qu,dq^). As mentioned above, 
the available log data provided very little information about 
individual queries and retrieved documents. For queries, we 
only had access to term vectors with individual terms con¬ 
verted to numeric ids. Similarly, for documents we only had 
access to their domain ids and base ranking generated by 
the Yandex search engine. In this form the personalization 
problem is similar to collaborative filtering/ranking where 
very little information about items and users is typically 
available. Neighborhood-based models that extract features 
from similar items/users have been shown to consistently 
perform well in these problems and were an essential part 
of the Netflix prize winning solution [^. In search per¬ 
sonalization, ranking models learned on features extracted 
from user’s search neighborhoods (historical sessions, queries 
etc.) have also been recently shown to perform well 

. Inspired by these results we concentrated our efforts on 
designing features using historical search information in the 
logs. 

We began by identifying several “contexts” of interest. 
Here, contexts are analogous to user/item neighborhoods 
in collaborative filtering, and contain collections of queries 
that have some relation to the target user-query-document 
triplet for which the features are being extracted. Formally 
we define context as: 

Definition 1. 

Context C = {{gi, qm}, {Dq^,}, {L<,i, 

is a set of queries with eorresponding doeument and rele- 

vanee label lists. 

Given a user-query-document triplet {u,qu,dq^), we pri¬ 
marily investigated two context types: user-related and 
query-related. For user-related contexts we considered all 
queries issued by u before qu and partitioned them into 2 
contexts - repetitions of qu and everything else. The ra¬ 
tionale behind this partitioning is that past instances of qu 
are particularly useful for inferring user’s search interests 

















for Qu [^, and should be processed separately. In addition 
to historical queries from u, we computed context from all 
instances of Qu issued by users other than u. This context 
provides global information on user preferences for docu¬ 
ments in Qu, and can be useful when little information from 
u is available. 

For each of these contexts we computed features on both 
document and domain levels. To use domains we simply sub¬ 
stituted dq^ with its domain and replaced document lists in 
each context with domain lists. Given that multiple docu¬ 
ments can have the same domain we expect domain features 
to be less precise. However, domain data is considerably less 
sparse (~70M unique documents vs ~5M unique domains) 
and can thus provide greater coverage. Using both docu¬ 
ment and domain lists we ended with a total of 6 contexts: 

• Ci: all repetitions of Qu by u 

• C2: same as Ci but with domain lists 

• C3: all queries other than Qu issued by u 

• C4: same as C3 but with domain lists 

• C5: all repetitions of Qu by users other than u 

• Ce: same as C5 but with domain lists 

In this form our contexts are similar to “views” explored in 
[^. The main difference between the two is that views are 
user-specific whereas contexts can include any set of queries 
including those from other users. Note that we also do not 
apply any session-based partitioning within the contexts and 
all queries are simply aggregated together. Throughout the 
challenge we experimented with several session-related con¬ 
texts (current session vs past sessions) but did not find them 
to give significant improvement. 

After specifying the contexts we defined a total of 20 
context-dependent features described in detail in Appendix 
[a] Most of these features aim to capture how frequently 
dq^ was shown/clicked/skipped/missed in the given context. 
The features also try to account for the rank position of dq^ 
across the context and similarity between Qu and context 
queries. Query similarity features ^4 - 99 (see Appendix 0 
are only relevant when queries other than Qu are included 
in the context, and are thus only extracted for contexts C3 
and C4. All together, we computed 20 features for Ci, C2, 
C5, Ce and 16 features for C3, C4 giving us a total of 112 
context features. In addition to these features, we added 
rank of dq^ returned by the search engine as the 113’th and 
final feature. 

All of the 20 context features only require simple opera¬ 
tions and are straightforward to implement. Similarly, con¬ 
texts Cl - C4 are readily available in the log data and can be 
easily extracted. Contexts C5 and Ce on the other hand, are 
trickier to compute efficiently since they require access to all 
instances of a particular query. To calculate these we created 
an inverted hash map index mapping each unique query id 
to a table storing all occurrences of this query id in the logs 
with corresponding document, domain and relevance label 
lists. For any query a single lookup in this index was then re¬ 
quired to compute features for every document returned for 
that query. The full features extraction for training, valida¬ 
tion and test queries (~1.7M queries with 17M documents) 
implemented in Matlab took roughly 7 hours on a Thinkpad 
W530 laptop with Intel i7-3720QM 2.6 GHz processor and 
32GB of RAM. 


3.3 Learning and Inference 

We trained several learning-to-rank and regression models 
on the extracted feature data. For learning-to-rank mod¬ 
els we used RankNet [^, ListNet HI and a variation of 
BoltzRank [^. Given the success of tree-based generalized 
gradient boosting machines (GBMs) on recent IR bench¬ 
marks such as the Yahoo!’s Learning To Rank challenge [^, 
we also experimented with state-of-the-art GBM learning- 
to-rank model LambdaMART [3. We omit the details of 
each model in this report and refer the reader to respective 
papers for detailed descriptions. 

For pairwise RankNet model we experimented with var¬ 
ious ways to extract pairwise preferences from click data. 
Specifically, many studies have shown that users scan re¬ 
turned results from top to bottom 14 so documents 
ranked below the bottom-most click were likely missed by 
the user. It is thus unclear whether we should use those 
documents during training and if so what relevance should 
they be assigned. Skipped documents (i.e. those above the 
bottom-most click) on the other hand, were clearly found 
not relevant by the user. However, it is also unclear whether 
they should be assigned the same relevance label 0 that is 
given to clicked documents with low dwell time. Intuitively, 
it seems like click is a stronger preference signal than skip 
even if dwell time after that click is low. 

To validate these hypotheses, we used a 1-hidden layer 
neural net implementation of RankNet and trained it on 
different preference targets extracted from clicks. We exper¬ 
imented with several variations of the cascade click model 
as well as various relevance re-weightings. Across these 
experiments the best results were obtained by simply set¬ 
ting relevance of skipped and missed documents to zero and 
training on all the available data. These results, although 
somewhat surprising, can be possibly explained by the fact 
that this assignment matches the target one used in NDGG 
for model evaluation. In light of these results we used the 
{0,1, 2} relevance assignment in all subsequent experiments. 


4 . RESULTS 

In this section we describe the main results achieved by 
our models. Throughout the experiments we consistently 
found that performance (gains/losses) on our in house val¬ 
idation set closely matched the public leaderboard. At the 
end of the competition we also saw that public and private 
leaderboard results were very consistent. In this report we 
thus concentrate on private leaderboard NDGG scores since 
these scores were used to compute the final standings. We 
note that these results were only available after the com¬ 
petition ended so it was impossible to directly optimize the 
models for this set. 

At the beginning of the competition, before applying so¬ 
phisticated machine learning methods, we created a simple 
heuristic-based model that re-ranked documents based on 
their total historical relevance. Specifically, for every test 
document dq^ we computed feature gi (see Appendix 0 us¬ 
ing all previous instances of Qu issued by u (context Ci). 
We then re-ranked documents by using original rank¬ 
ing to resolve ties. This model produced an NDGG@10 
of 0.79754 shown in Table (“re-rank by hist relevance”) 
which is a relative improvement of 0.0062 over the baseline 

^We also experimented with features 92 - 94 but found gi to 
work best. 





Table 3: Private leaderboard average NDCG@10 re¬ 
sults. Only results for the best model of each type 
are shown._ 


Model 

NDGG@10 

default ranking baseline 

0.79133 

re-rank by hist relevance 

0.79754 

regression (NN) 

0.80315 

learning-to-rank (NN) 

0.80324 

LambdaMART 

0.80330 

aggregate average 

0.80378 

aggregate RankNet 

0.80476 


non-personalized ranking produced by Yandex’s search en¬ 
gine. This submission would have placed 32’nd on the final 
leaderboard. 

After verifying that personalization from logs is possible, 
we proceeded to learning-to-rank and regression models. We 
trained 1-hidden layer neural net implementations of each 
model using tanh activation units and varying the number 
of hidden units in the [10, 200] range. Regression models 
were optimized with squared-loss objective function. Before 
learning, all features were standardized to have mean 0 and 
standard deviation of 1. For each model we used mini-batch 
learning with batch size of 100 queries (1000 documents), 
processing each query in parallel. Parallel processing allowed 
us to fully train these models on all of the available train¬ 
ing data in several hours using the same Thinkpad W530 
machine. 

Results for best neural net (NN) regression and learning- 
to-rank models are shown in Table [S] From the table we 
see that both models significantly improve NDCG@10 with 
relative gains of up to 0.0118 over the baseline ranking. We 
also see that regression models perform similarly to learning- 
to-rank ones with learning-to-rank only providing marginal 
gains. For both types of models we found that neural nets 
with 50 - 100 hidden units performed the best. Moreover, for 
learning-to-rank we found that RankNet performed slightly 
better than other ranking models but the difference was not 
significant (less than 0.0001). 

Best result for LambdaMART is also shown in Table [H 
We used publicly available RankLIB library to run Lamb¬ 
daMART. Training LambdaMART took a very long time 
(on the order of days) and used close to 25GB of RAM. 
We were thus unable to properly validate/tune all the hy¬ 
per parameters such as the number of leaves and learning 
rate. This possibly explains the marginal performance of 
this model as seen from Table where it is performing com¬ 
parably to the neural net models. 

4.1 Model Aggregation 

For each experiment that we ran throughout the competi¬ 
tion we saved models that performed best on the validation 
set. This gave us ~30 trained models at the end of the 
competition. It’s well known that blending improves accu¬ 
racy of individual models, and blended solutions have won 
many data mining competitions including the Netfiix chal¬ 
lenge [^. Keeping this in mind we spent the last few days 
of the competition finding the best blend of the models that 
we had trained. 

Before applying any blending techniques we standardized 


the scores produced by each model to have mean 0 and 
standard deviation 1. After normalization we began with a 
simple baseline that averaged all the available scores. This 
baseline obtained an NDGG@10 of 0.80378 and is shown 
in Table (“aggregate average”). While this is an improve¬ 
ment over the best individual model, the improvement is not 
significant. This can be attributed to the fact that many 
models in our blending set were considerably weaker than 
the best model. Gonsequently, including all of these mod¬ 
els in the blend with equal weight significantly affected the 
overall accuracy. It is thus evident that with many weaker 
models simple averaging is not optimal and more adaptive 
techniques are necessary. 

One possible solution is to use model-specific weights dur¬ 
ing aggregation. Weights are typically chosen to be a func¬ 
tion of model’s accuracy and several such functions have 
have been suggested in literature [^. However, instead of 
tuning these weights by hand a more principled and poten¬ 
tially more accurate approach is to apply one of the learning- 
to-rank methods to automatically learn the weights. 

We experimented with this approach and began by par¬ 
titioning our validatiorj^ set into two subsets. One subset 
was then used to train a linear RankNet on score outputs of 
all models in the aggregating set, and the other subset was 
used for validation. The result for this model is shown at 
the bottom of Table (“aggregate RankNet”). It produced 
an NDGG@10 of 0.80476 and was our best submission in 
this competition placing 4’th on the private leaderboard. 

4.2 Analysis of Results 

To analyze the effect of personalization we computed 
Kendall r correlations between rankings produced by our 
best model and the non-personalized baseline rankings from 
Yandex. The plot for randomly chosen 50K validation 
queries is shown in Figure [3 (a) | From this figure we see that 
for most queries r is above 0.7 indicating that our model 
is fairly conservative and tends to only re-rank a few doc¬ 
uments in the list. However, we also see that a number of 
queries are very aggressively re-ranked with r below 0.5. 

While aggressive personalization can significantly improve 
user search experience, it can also lead to dangerous outlier 
queries where top-N documents are ranked completely out 
of order. This is further illustrated in Figure |3(a)| which 
shows the difference in NDGG@10 between our model and 
Yandex’s base ranking for the same 50K queries. From this 
figure we see that while personalized model improves NDGG 
for many queries, some queries are also significantly hurt 
with NDGG drops of over 0.4. This further demonstrates 
the danger of applying personalization to all queries and 
emphasizes the need for adaptive strategies that selectively 
choose which queries should be re-ranked. Moreover, risk 
minimization (largest NDGG loss across all queries) might 
be a more appropriate objective for this task since it can 
produce models with more stable worst-case performance. 
This, however, is beyond the scope of this paper and we 
leave it for future research. 

5. CONCLUSION AND FUTURE WORK 

In this paper we presented our solution to the Yandex 
Personalized Web Search Ghallenge. In our approach search 

^Note that training set should not be used for aggregation 
since individual models could have overfitted on it. 











(a) (b) 

Figure 3: Figure |3(b)| shows NDCG@10 difference bet ween our best personalized model and static ranking 
produced by Yandex for 50K validation queries. Figure [3 (a) [ shows Kendall r distance histogram for the same 
50K queries. Kendall r is computed between personalized and non-personalized ranking for each query. 



logs were first partitioned into user and query dependent 
neighborhoods (contexts). Query-document features were 
then extracted from each context summarizing document 
preference within the context. Models trained on these fea¬ 
tures achieved signihcant improvements in accuracy over 
non-personalized ranker. 

In the future work we plan to explore contexts based on 
similar queries/users. Such contexts have been successfully 
applied in neighborhood-based collaborative filtering mod¬ 
els and can potentially be very useful in this domain as well. 
Both user an query similarities can be readily inferred from 
the search logs using statistics like issued query overlap for 
users and document/domain overlap for queries. These con¬ 
texts can be particularly useful for personalization of long- 
tail queries that occur very infrequently in the data and do 
not have enough preference data. 
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APPENDIX 

A. CONTEXT FEATURES 

Given user-query-document triplet (u^qu^dq^) and con¬ 
text C we extract a total of 20 context-dependent features 
gi - 920 (all missing features are set to 0): 

• Total relevance for all clicks on dq^ in C: 

9l — ^[dq — dq^]lq 

gG C dq G Dq 


where I[x] is an indicator function evaluating to 1 if x 
is true and 0 otherwise 

• Average relevance for all clicks on dq^ in C: 


92 


1 

Z^gGC Z^dqGDq 


dqu] 


— ^Qu]h 

gGC dq GDq 


• Max/min relevance across all clicks on dq^ in C: 

93 = argmax{/g|g eC,dq e Dg,dg = dq^} 


94 = argmin{/g|g e C,dg G Dg,dg = dq^} 


• Average similarity between Qu and all queries in C 
where dq^ was clicked: 


95 


1 

EgGcClicked(dg^,Dg) 


clicked(d, Dg)sim(^, Qu) 

qec 


where clicked(dg^, Dg) = 1 if d was clicked in Dg and 0 
otherwise. sim(^,^u) is similarity between q and qu, in 
this work we use intersection over union metric applied 
to query terms. 

• Max similarity between qu and all queries in C where 
dq^ was clicked: 

96 = argmax{sim(g, G C, clicked(dg^, Dg) = 1} 


• Average similarity between qu and all queries in C 
where dq^ was skipped (i.e. dq^ was not clicked but 
there was at least on click below dg^): 

97 = ^ . V/, -y'skipped(dq„,Dq)sim(g,q„) 

Z^gec skipped(d,„ ,Uq) ^ 

where skipped(dg^, Dg) = 1 if dq^ was skipped in Dg 
and 0 otherwise. 

• Max similarity between qu and all queries in C where 
dg^ was skipped: 

98 = argmax{sim(g,^n)k ^ C, skipped(dg^, Dg) = 1} 


• Average similarity between qu and all queries in C 
where dq^ was missed (i.e. all clicks were above d): 

= E,,cmissld(d,„.D,) g,D,)sim(g,q„) 

where missed(dg^, Dg) = 1 if dg^ was missed in Dg and 
0 otherwise. 

• Max similarity between qu and all queries in C where 
dq^ was missed: 

9 io = argmax{sim(g,^n)|g G C, missed(dg^, Dg) = 1} 

• Number of times dg^ was shown, clicked, skipped and 
missed in C: 

dll = y ^ ^[dqu ^ Dg] 
qec 

912 = clicked(dg^, Dg) 

qec 

913 = skipped(dg^,Dg) 

qec 

914 = y^ missed(dg^,Dg) 

qec 

• Number of times dq^ was shown in C discounted by 
rank: 

^ r_shown(d,„, D,) 

where r_shown(dg^, Dg) is rank of dg^ in Dg if it was 
shown and 0 otherwise. When r_shown(dg^, Dg) = 0 
the ratio is set to 0. 

• Number of times dg^ was clicked in C discounted by 
rank: 

r_clicked(dg^, Dg) 

where r_clicked(dg^, Dg) is rank of dg^ in Dg if it was 
clicked and 0 otherwise. When r_shown(dg^, Dg) = 0 
the ratio is set to 0. 

• Max/min rank of dg^ when it was clicked in C 

917 = argmax{r_clicked(dg^, Dg)|g G C} 

918 = argmin{r_clicked(dg^,Dg)|g G C} 

• Number of times dq^ was skipped in C discounted by 
rank: 

S ^■-skipped(dq„, Dq) 

where r_clicked(dg^, Dg) is rank of dg^ in Dg if it was 
skipped and 0 otherwise. When r_skipped(dgq^, Dg) = 0 
the ratio is set to 0. 

• Number of times dg^ was missed in C discounted by 
rank: 

~ ^ r_missed(d,„,D,) 

where r_clicked(dg^, Dg) is rank of dq^ in Dg if it was 
missed and 0 otherwise. When r_missed(dg^, Dg) = 0 
the ratio is set to 0. 



